Team 1
04 January 2021
1. Examining Continuous Variables
2. Looking for Structure: Dependency Relationships and Associations
3. Investigating Multivariate Continuous Data
Qplot automatically increases the bin size of the histogram, which shows a bimodal distribution with tails that increase on both sides of the histogram.
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.
data(galaxies)
galaxies <- as.data.frame(galaxies)
names(galaxies) <- 'Velocity'
par(fig=c(0,1,0,1),new=T)
qplot(galaxies$Velocity) +
labs(title='Histogram of Galaxy Velocity',
x='Velocity of Galaxy',
y='Frequency')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The density plot of the model shows three distinct superclusters with the far right tail not being as distinct.
library(mclust)
mod <- mod <- Mclust(galaxies$Velocity)
par(fig=c(0,1,0,1),new=T)
plot(mod,what="density")In order to present all the information, I think we need at least 5 different plots to spot all the factors the data set can provide. Boxplot, histogram, rugplot, dotplot, they can all provide different informations.
There are several different histogram forms, each telling a separate story. Default binwidths, dividing each variable’s range by 30, have been used. Other scalings could reveal more information and would be more interpretable. Is interesting that the vertical scales vary from maxima of 40 to over 400. Plotting histograms individually, choosing binwidths and scale limits are the main decisions to be taken.
data(survey, package="MASS")
par(fig=c(0,1,0,1),new=T)
hist(survey$Height,
xlab = 'Height',
main = 'Histogram of Student`s Height',
ylab = 'Frequency') b) Examination of national survey data on young adults shows that the separation between the distributions of men’s and women’s heights is not wide enough to produce bimodality.
library(ggplot2)
data(movies, package="ggplot2movies")
par(fig=c(0,1,0,1),new=T)
hist(movies$year[movies$length == 90 | movies$length == 7],
xlab = 'Year',
main = 'Histogram of Number of movies after 1980',
ylab = 'Nr.') a) The histogram shows that we have the peaks of 7 minutes or 90 minutes length for both periods: before 1980 and after 1980.
table
library(lawstat)
data(zuni, package="lawstat")
par(fig=c(0,1,0,1),new=T)
hist(zuni$Revenue,
xlab = 'Revenue',
main = 'Revenue Histogram',
ylab = 'Nr.') I prefer a histogram for showing 5% the lowest and the highets.
There is no “h39b.W1” attribute on CHAIN variable because it has been renamed in “log_virus”. For both cases I would use a histogram because I can easily see the number for each case.
## Loading required package: Matrix
## Loading required package: stats4
## mi (Version 1.0, packaged: 2015-04-16 14:03:10 UTC; goodrich)
## mi Copyright (C) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Trustees of Columbia University
## This program comes with ABSOLUTELY NO WARRANTY.
## This is free software, and you are welcome to redistribute it
## under the General Public License version 2 or later.
## Execute RShowDoc('COPYING') for details.
library(ggplot2)
data(CHAIN)
par(fig=c(0,1,0,1),new=T)
hist(CHAIN$log_virus,
xlab = 'Case',
main = 'Histogram of virus cases with 0s',
ylab = 'Nr.')library(mi)
library(ggplot2)
data(CHAIN)
par(fig=c(0,1,0,1),new=T)
hist(CHAIN$log_virus[CHAIN$log_virus != 0],
xlab = 'Case',
main = 'Histogram of virus cases without 0s',
ylab = 'Nr.')A diamond’s weight can be found in “carat” attribute. Let’s see how can we see
I wanted to put in balance the weight of a diamond with it’s price. Aparently the most expensive diamonds’s weight is between 1,5 and 3 grams. Some of the most cheapest diamonds have weight the least.
data(diamonds, package="ggplot2")
par(fig=c(0,1,0,1),new=T)
hist(diamonds$price,
xlab = 'Price',
main = 'Histogram of Diamonds Prices',
ylab = 'Frequency') For the distribution of Diamond Prices, I chose a histogram. I think it is very easy to understand looking at this histogram that the most expensive diamonds are the fewest. I think a factor that the most expensive diamonds are the fewest is that those diamonds are very rare and very hard to find. Another factor is that it requires more work than the others.
Figure 5.7
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(movies, package = "ggplot2movies")
ggplot(movies, aes(votes, rating)) + geom_point() + ylim(1, 10)(a) Excluding all films with fewer than 100 votes:
library(ggplot2)
library(dplyr)
data(movies, package = "ggplot2movies")
filtered <- filter(movies, votes > 100)
ggplot(filtered, aes(votes, rating)) + geom_point() + ylim(1, 10) (b) Excluding films with average rating greater than 9 and also the ones that have more than 100000 votes:
library(ggplot2)
library(dplyr)
data(movies, package = "ggplot2movies")
#summary(movies)
filtered2 <- filter(movies, rating < 9) #| votes>100000 )
ggplot(filtered2, aes(votes, rating)) + geom_point() + ylim(1, 10) (a) Number of observations in each experimental group (n.e) against the corresponding number of observations in each control group (n.c):
## Loading 'meta' package (version 4.15-1).
## Type 'help(meta)' for a brief overview.
Observations: - there is a linear relationship between the variables; - there are 4 outliers for higher values of the both variables (2500, 6000, 8500); - there are several gaps in the dataset, between the outliers previously mentioned; - there seems to be some overplotting at the lower values.
(b) Restricting the scatterplot to only those with less than 100 patients in each group:
library(meta)
data(Olkin1995)
ggplot(Olkin1995, aes(n.exp, n.cont)) + geom_point() + ylim(1, 100) + xlim(1,100) - As there was some overplotting in this range, “zooming” on that interval helps us gain more insight concerning 3-4 outliers, which were not visible before.
(a) Scatterplot of average revenue per pupil (Revenue) against the corresponding number of pupils (Mem):
Observations: - there are two outliers for higher values for the Mem variable; -there are 4 outliers for values close to 0 for the Mem variable and bigger values for the Revenue; - there is a certain interval in which the values are situated and it may be a case of overplotting, as there are 420 rows in the dataset and only a few points on the scatterplot.
(b) Plotting against log of the number of pupils preserves the order of the observations while making outliers less extreme. So the log transform enhances the visualization.
Logging also revenue per pupil adds no other insight to the scatterplot, as shown below:
(a) Scatterplot of the heights:
There are some outliers, for both higher and lower values of the variables, but it is hard to determine which ones.
(b) Including both points and highest density regions:
## This is hdrcde 3.3
data(father.son, package="UsingR")
par(mar=c(3.1, 4.1, 1.1, 2.1))
with(father.son,hdr.boxplot.2d(fheight, sheight, show.points=TRUE, prob=c(0.01,0.05,0.5,0.75))) After using a density estimate, it’s easier to see determine some outliers, outside the contours: 3 for lower values and 2-3 for higher values, all bivariate.
Note: mar= A numeric vector of length 4, which sets the margin sizes in the following order: bottom, left, top, and right. The default is c(5.1, 4.1, 4.1, 2.1).
(c) Fitting a linear model to the data and a loess smooth:
data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() + geom_smooth(method="lm", colour="red") + geom_abline(slope=1, intercept=0)## `geom_smooth()` using formula 'y ~ x'
A nonlinear model is not necessary, as the two curves are almost identical:
data(father.son, package="UsingR")
ggplot(father.son, aes(fheight, sheight)) + geom_point() +
geom_smooth(method="lm", colour="red", se=FALSE) +
stat_smooth()## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
A subset of Roberts’ bank sex discrimination dataset from 1979 is available in the package Sleuth2 under the name case1202.
(a) Scatterplot matrix of the three variables: Senior, Age and Exper:
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data(case1202, package="Sleuth2")
#summary(case1202)
par(mar=c(1.1, 1.1, 1.1, 1.1))
#spm(select(case1202, c(4:5,7)), diagonal="histogram", smoother=FALSE, reg.line=FALSE) #groups=bank$Status)
ggpairs(case1202[,c(4:5, 7)], title="Bank discrimination", diag=list(continuous='density'), axisLabels='none')(b) Scatterplots involving seniority do not have the structure of the scatterplot of experience against age because:
Figure 5.8.
data(Cars93, package="MASS")
#print(Cars93)
ggplot(Cars93, aes(Weight, MPG.city)) + geom_point() +
geom_smooth(colour="green") + ylim(0,50)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Note: fuel economy decreases with weight quite quickly initially and then more slowly.
Plotting 1/MPG.City (litres per 100 km instead of miles per gallon) against Horsepower:
data(Cars93, package="MASS")
ggplot(Cars93, aes((1/MPG.city), Horsepower)) + geom_point() + geom_smooth(colour="green") ## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Plotting the two variables against each other we get a linear relationship, because the more horsepower it has, the better the gas mileage is. But after a certain value (at about 0.06 litres per 100 km), the cars with less horsepower need more fuel to gain speed.
The outliers represent the cars that have a lot more horsepower than the others.
data(Cars93, package="MASS")
# filtered <- filter(Cars93, Horsepower <200 & MPG.city >20)
filtered <- filter(Cars93, Horsepower > 100)
#print(filtered)The leafshape dataset in the DAAG package includes three measurements on each leaf (length, width, petiole) and the logarithms of the three measurements.
(a) Sploms for the two sets of three variables
library(GGally)
library(ggplot2)
data(leafshape, package="DAAG")
#summary(leafshape)
print(leafshape)## bladelen petiole bladewid latitude logwid logpet loglen
## 1 33.880000 1.40263200 13.650 5.0 2.6137395 0.3383504716 3.5228249
## 2 33.320000 1.01626000 10.260 5.0 2.3282528 0.0161292219 3.5061578
## 3 29.350000 2.39202500 12.210 5.0 2.5022553 0.8721402875 3.3792925
## 4 26.870000 0.80878700 8.700 5.0 2.1633230 -0.2122196846 3.2910104
## 5 26.670000 0.80276700 8.410 5.0 2.1294215 -0.2196907690 3.2835393
## 6 24.230000 1.49014500 7.700 5.0 2.0412203 0.3988734307 3.1875915
## 7 23.850000 1.09233000 5.690 5.0 1.7387102 0.0883130295 3.1717842
## 8 23.300000 1.91060000 8.410 5.0 2.1294215 0.6474173289 3.1484534
## 9 23.110000 1.88115400 7.950 5.0 2.0731719 0.6318854183 3.1402654
## 10 23.080000 0.97628400 6.980 5.0 1.9430489 -0.0240017513 3.1389664
## 11 22.140000 3.01546800 8.010 5.0 2.0806908 1.1037550420 3.0973859
## 12 21.570000 2.62291200 8.670 5.0 2.1598688 0.9642851510 3.0713035
## 13 21.300000 1.46118000 6.590 5.0 1.8855533 0.3792443285 3.0587071
## 14 20.970000 1.36095300 6.730 5.0 1.9065751 0.3081851896 3.0430928
## 15 20.820000 0.88485000 6.710 5.0 1.9035990 -0.1223371399 3.0359141
## 16 20.480000 0.97075200 6.060 5.0 1.8017098 -0.0296842501 3.0194488
## 17 20.420000 1.22315800 8.480 5.0 2.1377104 0.2014360389 3.0165148
## 18 20.410000 0.95722900 6.240 5.0 1.8309802 -0.0437126267 3.0160250
## 19 20.370000 0.99609300 5.070 5.0 1.6233408 -0.0039146523 3.0140632
## 20 20.130000 1.34468400 6.730 5.0 1.9065751 0.2961590412 3.0022112
## 21 20.040000 1.17835200 5.010 5.0 1.6114359 0.1641168521 2.9977303
## 22 19.370000 1.58640300 6.620 5.0 1.8900954 0.4614691893 2.9637255
## 23 19.340000 1.08884200 6.570 5.0 1.8825138 0.0851147462 2.9621755
## 24 19.300000 1.32784000 7.180 5.0 1.9712994 0.2835535619 2.9601051
## 25 19.210000 0.85100300 6.890 5.0 1.9300711 -0.1613396252 2.9554310
## 26 19.020000 0.88633200 7.440 5.0 2.0068708 -0.1206636807 2.9454911
## 27 18.560000 0.95027200 6.220 5.0 1.8277699 -0.0510070196 2.9210087
## 28 18.340000 0.67307800 5.600 5.0 1.7227666 -0.3958940571 2.9090845
## 29 18.190000 0.61664100 9.280 5.0 2.2278615 -0.4834682721 2.9008720
## 30 18.170000 0.36885100 8.320 5.0 2.1186623 -0.9973625105 2.8997719
## 31 18.070000 0.71918600 6.180 5.0 1.8213183 -0.3296352621 2.8942531
## 32 17.600000 0.62656000 5.430 5.0 1.6919391 -0.4675107391 2.8678989
## 33 17.510000 1.59165900 5.850 5.0 1.7664417 0.4647768685 2.8627721
## 34 17.260000 0.57303200 4.500 5.0 1.5040774 -0.5568137174 2.8483917
## 35 17.260000 0.91650600 6.300 5.0 1.8405496 -0.0871866651 2.8483917
## 36 17.220000 0.60958800 5.670 5.0 1.7351891 -0.4949719598 2.8460715
## 37 17.020000 0.74547600 5.600 5.0 1.7227666 -0.2937323385 2.8343891
## 38 16.800000 1.32384000 5.930 5.0 1.7800242 0.2805366043 2.8213789
## 39 16.060000 1.23180200 6.070 5.0 1.8033586 0.2084781379 2.7763317
## 40 15.940000 0.54355400 6.600 5.0 1.8870696 -0.6096262213 2.7688317
## 41 15.920000 0.60336800 6.210 5.0 1.8261609 -0.5052279865 2.7675762
## 42 15.580000 0.49856000 4.540 5.0 1.5129270 -0.6960313357 2.7459880
## 43 15.460000 0.75908600 5.510 5.0 1.7065646 -0.2756402010 2.7382560
## 44 15.020000 0.78704800 4.620 5.0 1.5303947 -0.2394660413 2.7093826
## 45 14.920000 0.97129200 4.430 5.0 1.4883996 -0.0291281350 2.7027026
## 46 14.540000 0.65720800 5.010 5.0 1.6114359 -0.4197547200 2.6769035
## 47 14.080000 0.18867200 5.370 5.0 1.6808279 -1.6677452213 2.6447554
## 48 13.750000 0.55275000 5.190 5.0 1.6467337 -0.5928494592 2.6210388
## 49 13.740000 0.98790600 5.940 5.0 1.7817091 -0.0121677275 2.6203113
## 50 13.590000 0.35469900 5.620 5.0 1.7263317 -1.0364857365 2.6093342
## 51 13.270000 0.75904400 5.130 5.0 1.6351057 -0.2756955323 2.5855058
## 52 13.070000 0.71362200 5.520 5.0 1.7083779 -0.3374018686 2.5703195
## 53 13.020000 0.57678600 4.850 5.0 1.5789787 -0.5502839652 2.5664866
## 54 13.020000 0.17186400 4.700 5.0 1.5475625 -1.7610518126 2.5664866
## 55 12.990000 1.98227400 5.260 5.0 1.6601310 0.6842446706 2.5641798
## 56 12.870000 0.71943300 4.970 5.0 1.6034198 -0.3292918772 2.5548990
## 57 12.650000 0.66286000 5.180 5.0 1.6448051 -0.4111914725 2.5376572
## 58 12.280000 0.38190800 5.510 5.0 1.7065646 -0.9625755371 2.5079719
## 59 12.030000 0.47999700 2.330 5.0 0.8458683 -0.7339754251 2.4874035
## 60 11.000000 0.42350000 3.790 5.0 1.3323660 -0.8592017649 2.3978953
## 61 10.400000 0.85904000 3.520 5.0 1.2584610 -0.1519397923 2.3418058
## 62 10.020000 0.35070000 3.230 5.0 1.1724821 -1.0478241218 2.3045831
## 63 9.810000 0.36000000 2.820 5.0 1.0367369 -1.0216512475 2.2834023
## 64 8.420000 0.22734000 2.530 5.0 0.9282193 -1.4813085847 2.1306098
## 65 2.280000 0.11263200 1.250 5.0 0.2231436 -2.1836294118 0.8241754
## 66 43.400000 3.37652000 10.940 5.0 2.3924258 1.2168455933 3.7704594
## 67 41.390000 5.54212100 17.750 5.0 2.8763855 1.7123772795 3.7230393
## 68 33.350000 3.23161500 12.270 5.0 2.5071573 1.1729820123 3.5070578
## 69 29.340000 3.22740000 10.050 5.0 2.3075726 1.1716768595 3.3789518
## 70 29.250000 4.43137500 10.420 5.0 2.3437270 1.4887099196 3.3758796
## 71 28.480000 4.93000000 7.420 5.0 2.0041791 1.5953389881 3.3492021
## 72 27.380000 22.16000000 26.410 5.0 3.2737427 3.0982888619 3.3098128
## 73 26.600000 1.25552000 7.250 5.0 1.9810015 0.2275498294 3.2809112
## 74 24.880000 3.30157600 7.750 5.0 2.0476928 1.1943999302 3.2140643
## 75 24.800000 3.86700000 9.150 5.0 2.2137539 1.3524790126 3.2108437
## 76 23.750000 2.50325000 7.140 5.0 1.9657128 0.9175898876 3.1675825
## 77 22.050000 4.58860500 8.610 5.0 2.1529243 1.5235760563 3.0933126
## 78 20.360000 2.69159200 6.080 5.0 1.8050047 0.9901328401 3.0135722
## 79 19.880000 3.21062000 6.080 5.0 1.8050047 1.1664640649 2.9897142
## 80 19.350000 3.04375500 6.960 5.0 1.9401795 1.1130919506 2.9626924
## 81 6.740000 0.21972400 2.480 9.1 0.9082586 -1.5153830657 1.9080599
## 82 7.700000 0.38962000 3.140 9.1 1.1442228 -0.9425833738 2.0412203
## 83 7.970000 0.28692000 2.630 9.1 0.9669838 -1.2485518477 2.0756845
## 84 8.820000 0.38014200 3.780 9.1 1.3297240 -0.9672104119 2.1770219
## 85 9.440000 0.43518400 3.950 9.1 1.3737156 -0.8319863488 2.2449560
## 86 9.920000 0.29264000 4.110 9.1 1.4134230 -1.2288120943 2.2945529
## 87 11.230000 0.52219500 4.110 9.1 1.4134230 -0.6497141976 2.4185888
## 88 11.260000 0.50670000 5.220 9.1 1.6524974 -0.6798361665 2.4212566
## 89 11.290000 0.29128200 4.170 9.1 1.4279160 -1.2334634089 2.4239174
## 90 12.120000 0.56600400 4.030 9.1 1.3937664 -0.5691541337 2.4948570
## 91 12.260000 0.60564400 5.450 9.1 1.6956156 -0.5014629243 2.5063419
## 92 12.330000 0.49073400 3.720 9.1 1.3137237 -0.7118530495 2.5120353
## 93 12.920000 0.57623200 3.810 9.1 1.3376292 -0.5512449216 2.5587765
## 94 13.000000 0.31070000 3.650 9.1 1.2947272 -1.1689274626 2.5649494
## 95 13.560000 0.97360800 4.150 9.1 1.4231083 -0.0267465204 2.6071243
## 96 14.570000 0.88439900 4.880 9.1 1.5851452 -0.1228469607 2.6789646
## 97 14.580000 1.25679600 5.850 9.1 1.7664417 0.2285656253 2.6796507
## 98 14.600000 0.87016000 6.180 9.1 1.8213183 -0.1390781762 2.6810215
## 99 15.120000 0.51105600 6.420 9.1 1.8594181 -0.6712761057 2.7160184
## 100 15.200000 0.42712000 6.130 9.1 1.8131947 -0.8506902748 2.7212954
## 101 15.670000 1.09063200 6.250 9.1 1.8325815 0.0867573447 2.7517481
## 102 16.180000 0.60836800 4.050 9.1 1.3987169 -0.4969753170 2.7837759
## 103 16.620000 0.30082200 5.800 9.1 1.7578579 -1.2012365513 2.8106068
## 104 16.710000 1.56739800 4.370 9.1 1.4747630 0.4494169196 2.8160073
## 105 17.100000 0.53865000 4.850 9.1 1.5789787 -0.6186892696 2.8390785
## 106 17.630000 1.65016800 4.800 9.1 1.5686159 0.5008771009 2.8696020
## 107 17.690000 1.07555200 6.280 9.1 1.8373700 0.0728340182 2.8729995
## 108 18.700000 1.04159000 7.130 9.1 1.9643112 0.0407483918 2.9285235
## 109 19.100000 2.27672000 8.320 9.1 2.1186623 0.8227358107 2.9496883
## 110 19.290000 1.32329400 7.080 9.1 1.9572739 0.2801240827 2.9595868
## 111 19.990000 1.74712600 6.650 9.1 1.8946169 0.5579721522 2.9952321
## 112 20.250000 1.22107500 7.270 9.1 1.9837563 0.1997316183 3.0081548
## 113 20.870000 0.88488800 8.210 9.1 2.1053529 -0.1222941957 3.0383127
## 114 21.160000 0.50995600 7.130 9.1 1.9643112 -0.6734308315 3.0521126
## 115 26.990000 1.34950000 10.340 9.1 2.3360199 0.2997341535 3.2954664
## 116 28.940000 1.27625400 8.430 9.1 2.1317968 0.2439292247 3.3652247
## 117 45.230000 3.52341700 16.670 9.1 2.8136107 1.2594312574 3.8117606
## 118 45.890000 51.08000000 43.580 9.1 3.7745983 3.9333930312 3.8262472
## 119 20.920000 12.92437600 15.090 9.1 2.7140323 2.5591151407 3.0407056
## 120 22.870000 11.14683800 14.810 9.1 2.6953026 2.4111558702 3.1298260
## 121 25.940000 11.40581800 16.560 9.1 2.8069901 2.4341235761 3.2557862
## 122 14.240000 5.70739200 7.220 9.1 1.9768550 1.7417621768 2.6560549
## 123 16.250000 4.25425000 5.640 9.1 1.7298841 1.4479184833 2.7880929
## 124 49.300000 9.71703000 26.250 9.1 3.2676660 2.2738800162 3.8979241
## 125 24.180000 2.46394200 10.800 9.1 2.3795461 0.9017625064 3.1855258
## 126 14.580000 1.43321400 4.500 9.1 1.5040774 0.3599194748 2.6796507
## 127 49.450000 4.83126500 11.540 9.1 2.4458193 1.5751083381 3.9009621
## 128 13.800000 1.31652000 6.580 9.1 1.8840347 0.2749918916 2.6246686
## 129 14.030000 1.22481900 5.400 9.1 1.6863990 0.2027930780 2.6411979
## 130 25.800000 1.91952000 7.120 9.1 1.9629077 0.6520751548 3.2503745
## 131 14.790000 0.93177000 3.370 9.1 1.2149127 -0.0706692759 2.6939513
## 132 28.940000 1.72193000 8.840 9.1 2.1792869 0.5434457548 3.3652247
## 133 18.350000 0.78905000 6.990 9.1 1.9444806 -0.2369255888 2.9096296
## 134 11.670000 0.39444600 4.670 9.1 1.5411591 -0.9302730302 2.4570214
## 135 13.110000 0.29104200 4.680 9.1 1.5432981 -1.2342876923 2.5733753
## 136 7.770000 0.47785500 3.310 10.4 1.1969482 -0.7384479398 2.0502702
## 137 10.310000 0.28249400 2.620 10.4 0.9631743 -1.2640979676 2.3331143
## 138 11.280000 1.04678400 4.700 10.4 1.5475625 0.0457226069 2.4230312
## 139 11.670000 0.53215200 2.820 10.4 1.0367369 -0.6308261162 2.4570214
## 140 13.380000 0.58738200 3.750 10.4 1.3217558 -0.5320799042 2.5937611
## 141 14.100000 0.37647000 4.920 10.4 1.5933085 -0.9769169162 2.6461748
## 142 14.100000 0.56118000 5.590 10.4 1.7209793 -0.5777135693 2.6461748
## 143 14.180000 0.72743400 5.300 10.4 1.6677068 -0.3182320057 2.6518325
## 144 15.010000 1.34189400 5.810 10.4 1.7595806 0.2940820488 2.7087166
## 145 16.100000 0.85652000 5.770 10.4 1.7526721 -0.1548776106 2.7788193
## 146 16.560000 0.62265600 4.780 10.4 1.5644405 -0.4737610796 2.8069901
## 147 16.590000 0.64203300 6.290 10.4 1.8389611 -0.4431155747 2.8088001
## 148 16.860000 1.71297600 4.620 10.4 1.5303947 0.5382322087 2.8249440
## 149 17.010000 1.55811600 7.970 10.4 2.0756845 0.4434773991 2.8338014
## 150 17.860000 0.51079600 6.180 10.4 1.8213183 -0.6717849857 2.8825636
## 151 18.010000 1.62090000 8.050 10.4 2.0856721 0.4829815505 2.8909272
## 152 18.100000 1.08962000 6.570 10.4 1.8825138 0.0858290116 2.8959119
## 153 18.710000 0.97479100 5.580 10.4 1.7191888 -0.0255321899 2.9290581
## 154 18.750000 0.75562500 5.960 10.4 1.7850705 -0.2802100576 2.9311938
## 155 19.740000 2.49711000 7.330 10.4 1.9919755 0.9151340632 2.9826470
## 156 20.100000 0.74772000 6.310 10.4 1.8421357 -0.2907267026 3.0007198
## 157 20.160000 1.06848000 8.140 10.4 2.0967902 0.0662370778 3.0037004
## 158 20.320000 2.28396800 10.010 10.4 2.3035846 0.8259142812 3.0116056
## 159 20.350000 2.79202000 10.290 10.4 2.3311725 1.0267653482 3.0130809
## 160 20.420000 1.06184000 4.790 10.4 1.5665304 0.0600032523 3.0165148
## 161 22.670000 1.27405400 7.490 10.4 2.0135688 0.2422039424 3.1210425
## 162 22.900000 1.33049000 9.000 10.4 2.1972246 0.2855472954 3.1311369
## 163 23.800000 0.70210000 8.370 10.4 2.1246539 -0.3536794350 3.1696856
## 164 23.870000 1.28181900 9.340 10.4 2.2343063 0.2482801629 3.1726224
## 165 24.700000 2.01058000 10.590 10.4 2.3599102 0.6984232377 3.2068032
## 166 26.400000 0.90816000 8.560 10.4 2.1471002 -0.0963347045 3.2733640
## 167 28.120000 1.16698000 7.310 10.4 1.9892433 0.1544192152 3.3364811
## 168 30.060000 1.56913200 9.020 10.4 2.1994443 0.4505226002 3.4031954
## 169 28.370000 6.76908200 15.840 10.4 2.7625384 1.9123654795 3.3453322
## 170 17.330000 3.39841300 7.200 10.4 1.9740810 1.2233085579 2.8524391
## 171 17.680000 1.46213600 5.420 10.4 1.6900958 0.3798983803 2.8724341
## 172 18.000000 1.00800000 4.120 10.4 1.4158532 0.0079681696 2.8903718
## 173 19.210000 12.23484900 9.340 10.4 2.2343063 2.5042883552 2.9554310
## 174 20.180000 1.95140600 10.130 10.4 2.3155013 0.6685501384 3.0046920
## 175 21.020000 2.82508800 7.240 10.4 1.9796212 1.0385395146 3.0454744
## 176 22.440000 8.59003200 13.090 10.4 2.5718486 2.1506024613 3.1108451
## 177 26.870000 12.65577000 13.590 10.4 2.6093342 2.5381132377 3.2910104
## 178 30.800000 2.40240000 10.140 10.4 2.3164880 0.8764682377 3.4275147
## 179 31.950000 24.93000000 36.620 10.4 3.6005945 3.2160718975 3.4641722
## 180 32.840000 5.45144000 11.000 10.4 2.3978953 1.6958797940 3.4916473
## 181 33.290000 2.86294000 10.540 10.4 2.3551775 1.0518490689 3.5052571
## 182 38.220000 2.63335800 15.720 10.4 2.7549338 0.9682598378 3.6433589
## 183 46.600000 48.88000000 46.170 10.4 3.8323302 3.8893683149 3.8416005
## 184 46.940000 3.37498600 15.800 10.4 2.7600099 1.2163911762 3.8488702
## 185 72.300000 2.39313000 20.600 10.4 3.0252911 0.8726021326 4.2808241
## 186 6.930000 0.44906400 3.100 17.1 1.1314021 -0.8005898624 1.9358598
## 187 7.270000 0.26172000 2.500 17.1 0.9162907 -1.3404800490 1.9837563
## 188 7.430000 0.76603300 2.980 17.1 1.0919233 -0.2665300292 2.0055259
## 189 8.000000 0.59280000 3.060 17.1 1.1184149 -0.5228982050 2.0794415
## 190 8.080000 0.49290000 3.030 17.1 1.1085626 -0.7074489653 2.0893919
## 191 8.480000 0.81916800 4.330 17.1 1.4655675 -0.1994660880 2.1377104
## 192 8.560000 0.59834400 4.820 17.1 1.5727739 -0.5135894396 2.1471002
## 193 8.720000 0.59010000 4.880 17.1 1.5851452 -0.5274632649 2.1656192
## 194 9.153333 0.52082465 3.010 17.1 1.1019401 -0.6523418620 2.2141181
## 195 9.353333 0.89324330 4.040 17.1 1.3962447 -0.1128962806 2.2357328
## 196 9.480000 0.23984400 3.240 17.1 1.1755733 -1.4277665670 2.2491843
## 197 9.560000 0.65964000 2.840 17.1 1.0438041 -0.4160610473 2.2575877
## 198 10.060000 1.25951200 3.370 17.1 1.2149127 0.2307243444 2.3085672
## 199 10.060000 0.60259400 3.790 17.1 1.3323660 -0.5065116092 2.3085672
## 200 10.110000 0.74460150 3.720 17.1 1.3137237 -0.2949061030 2.3135250
## 201 10.370000 0.28100000 3.480 17.1 1.2470323 -1.2694006096 2.3389170
## 202 10.440000 0.48963600 3.990 17.1 1.3837912 -0.7140930211 2.3456446
## 203 10.500000 0.09975000 3.330 17.1 1.2029723 -2.3050882232 2.3513753
## 204 10.550000 1.00014000 5.160 17.1 1.6409366 0.0001399902 2.3561259
## 205 10.870000 0.84786000 4.010 17.1 1.3887912 -0.1650397512 2.3860067
## 206 11.220000 1.10404800 4.650 17.1 1.5368672 0.0989834252 2.4176979
## 207 11.660000 0.87100000 3.620 17.1 1.2864740 -0.1381133021 2.4561642
## 208 11.860000 0.85629200 4.330 17.1 1.4655675 -0.1551438395 2.4731714
## 209 12.230000 0.38646800 4.110 17.1 1.4134230 -0.9507062087 2.5038919
## 210 12.840000 0.88981200 4.360 17.1 1.4724721 -0.1167450745 2.5525653
## 211 12.900000 0.27090000 4.180 17.1 1.4303112 -1.3060055299 2.5572273
## 212 12.960000 0.41536800 4.030 17.1 1.3937664 -0.8785904047 2.5618677
## 213 13.030000 0.28763725 3.290 17.1 1.1908876 -1.2460551414 2.5672544
## 214 13.400000 1.47802000 5.590 17.1 1.7209793 0.3907033542 2.5952547
## 215 13.413333 0.85487195 4.340 17.1 1.4678743 -0.1568035850 2.5962492
## 216 13.590000 0.96964650 5.640 17.1 1.7298841 -0.0308237069 2.6093342
## 217 14.360000 0.54855200 4.260 17.1 1.4492692 -0.6004731997 2.6644466
## 218 14.400000 0.45072000 3.975 17.1 1.3800247 -0.7969089749 2.6672282
## 219 14.600000 1.04828000 3.050 17.1 1.1151416 0.0471507258 2.6810215
## 220 14.900000 0.93125000 4.570 17.1 1.5195132 -0.0712275093 2.7013612
## 221 16.490000 1.20541900 7.610 17.1 2.0294632 0.1868272243 2.8027541
## 222 16.670000 0.69847300 5.680 17.1 1.7369512 -0.3588587553 2.8136107
## 223 17.040000 1.15957200 6.860 17.1 1.9257074 0.1480509715 2.8355635
## 224 19.890000 1.17950000 7.810 17.1 2.0554050 0.1650906199 2.9902171
## 225 24.430000 1.20440000 6.960 17.1 1.9401795 0.1859815176 3.1958119
## 226 30.350000 1.65862750 6.780 17.1 1.9139771 0.5059904531 3.4127965
## 227 13.980000 1.50984000 3.900 17.1 1.3609766 0.4120036849 2.6376277
## 228 14.450000 2.88422000 5.720 17.1 1.7439688 1.0592544995 2.6706944
## 229 15.250000 0.84027500 3.000 17.1 1.0986123 -0.1740260598 2.7245795
## 230 15.300000 1.87425000 5.880 17.1 1.7715568 0.6282085794 2.7278528
## 231 15.348000 1.80492480 4.840 17.1 1.5769147 0.5905189289 2.7309852
## 232 15.460000 1.84283200 5.850 17.1 1.7664417 0.6113035188 2.7382560
## 233 15.560000 1.03940800 4.720 17.1 1.5518088 0.0386513203 2.7447035
## 234 17.000000 0.41480000 7.220 17.1 1.9768550 -0.8799588026 2.8332133
## 235 17.800000 1.55572000 5.280 17.1 1.6639261 0.4419384610 2.8791985
## 236 17.970000 3.52122150 6.150 17.1 1.8164521 1.2588079465 2.8887037
## 237 18.760000 1.87600000 7.190 17.1 1.9726912 0.6291418506 2.9317269
## 238 19.680000 6.59280000 8.160 17.1 2.0992442 1.8859781445 2.9796029
## 239 20.980000 3.26868400 6.580 17.1 1.8840347 1.1843874574 3.0435696
## 240 22.215000 10.47437250 11.460 17.1 2.4388627 2.3489315595 3.1007677
## 241 22.640000 0.87503600 3.800 17.1 1.3350011 -0.1334902506 3.1197183
## 242 25.275000 3.63454500 6.850 17.1 1.9242487 1.2904839312 3.2298158
## 243 26.550000 2.63508750 5.920 17.1 1.7783364 0.9689163883 3.2790297
## 244 26.992500 3.83563425 8.530 17.1 2.1435894 1.3443348058 3.2955590
## 245 35.843333 3.79939330 9.510 17.1 2.2523439 1.3348413956 3.5791576
## 246 42.275000 1.49442125 23.450 17.1 3.1548705 0.4017390081 3.7441959
## 247 0.855500 0.06916717 0.545 28.2 -0.6069695 -2.6712288786 -0.1560692
## 248 2.080000 0.12937600 0.793 28.2 -0.2319321 -2.0450323855 0.7323679
## 249 6.220000 0.43229000 3.200 28.2 1.1631508 -0.8386586197 1.8277699
## 250 6.900000 0.24978000 2.860 28.2 1.0508216 -1.3871747485 1.9315214
## 251 7.160000 0.34010000 3.260 28.2 1.1817272 -1.0785155870 1.9685100
## 252 8.030000 0.66247500 4.020 28.2 1.3912819 -0.4117724577 2.0831845
## 253 8.160000 0.53692800 3.330 28.2 1.2029723 -0.6218912717 2.0992442
## 254 8.410000 0.27837100 2.660 28.2 0.9783261 -1.2788005226 2.1294215
## 255 9.660000 0.68199600 3.440 28.2 1.2354715 -0.3827314863 2.2679936
## 256 11.075000 1.03900000 4.480 28.2 1.4996230 0.0382587121 2.4046903
## 257 11.390000 0.75060100 4.690 28.2 1.5454326 -0.2868810600 2.4327358
## 258 11.460000 0.59935800 3.420 28.2 1.2296406 -0.5118961966 2.4388627
## 259 11.920000 0.72354400 5.080 28.2 1.6253113 -0.3235939193 2.4782177
## 260 4.240000 0.31588000 1.540 28.2 0.4317824 -1.1523928844 1.4445633
## 261 7.220000 0.88950400 2.930 28.2 1.0750024 -0.1170912750 1.9768550
## 262 8.230000 0.28475800 3.050 28.2 1.1151416 -1.2561155822 2.1077860
## 263 9.540000 0.99693000 4.000 28.2 1.3862944 -0.0030747221 2.2554935
## 264 9.700000 0.77503000 2.930 28.2 1.0750024 -0.2548535407 2.2721259
## 265 10.010000 0.90690600 3.310 28.2 1.1969482 -0.0977164726 2.3035846
## 266 10.800000 0.95094000 3.570 28.2 1.2725656 -0.0503043099 2.3795461
## 267 11.460000 3.05065200 5.040 28.2 1.6174061 1.1153553383 2.4388627
## 268 11.870000 5.78899900 6.680 28.2 1.8991180 1.7559593924 2.4740142
## 269 11.940000 0.52774800 3.050 28.2 1.1151416 -0.6391363819 2.4798941
## 270 12.380000 0.66397654 3.360 28.2 1.2119410 -0.4095084615 2.5160823
## 271 12.600000 0.72702000 4.040 28.2 1.3962447 -0.3188012915 2.5336968
## 272 12.680000 0.90471800 3.280 28.2 1.1878434 -0.1001319861 2.5400259
## 273 13.840000 1.20131200 4.300 28.2 1.4586150 0.1834142929 2.6275630
## 274 16.920000 0.65818800 3.470 28.2 1.2441546 -0.4182646742 2.8284964
## 275 21.900000 21.35000000 23.650 28.2 3.1633631 3.0610517397 3.0864866
## 276 29.500000 0.85550000 20.000 28.2 2.9957323 -0.1560691856 3.3843903
## 277 36.520000 30.18000000 30.570 28.2 3.4200191 3.4071794533 3.5978601
## 278 1.149429 0.15858787 0.890 42.0 -0.1165338 -1.8414464607 0.1392653
## 279 6.033333 0.48165907 2.750 42.0 1.0116009 -0.7305187326 1.7972996
## 280 3.794286 0.48886719 1.510 42.0 0.4121097 -0.7156644194 1.3334963
## 281 4.950500 0.60445605 2.130 42.0 0.7561220 -0.5034263163 1.5994886
## 282 16.646667 1.40719270 3.730 42.0 1.3164082 0.3415967282 2.8122100
## 283 4.128000 0.39653568 1.160 42.0 0.1484200 -0.9249892546 1.4177930
## 284 12.900000 1.15197000 2.550 42.0 0.9360934 0.1414735203 2.5572273
## 285 1.121333 0.13426056 0.415 42.0 -0.8794768 -2.0079728597 0.1145182
## 286 1.475000 0.09292500 0.145 42.0 -1.9310215 -2.3759625628 0.3886580
## arch location
## 1 0 Sabah
## 2 0 Sabah
## 3 0 Sabah
## 4 0 Sabah
## 5 0 Sabah
## 6 0 Sabah
## 7 0 Sabah
## 8 0 Sabah
## 9 0 Sabah
## 10 0 Sabah
## 11 0 Sabah
## 12 0 Sabah
## 13 0 Sabah
## 14 0 Sabah
## 15 0 Sabah
## 16 0 Sabah
## 17 0 Sabah
## 18 0 Sabah
## 19 0 Sabah
## 20 0 Sabah
## 21 0 Sabah
## 22 0 Sabah
## 23 0 Sabah
## 24 0 Sabah
## 25 0 Sabah
## 26 0 Sabah
## 27 0 Sabah
## 28 0 Sabah
## 29 0 Sabah
## 30 0 Sabah
## 31 0 Sabah
## 32 0 Sabah
## 33 0 Sabah
## 34 0 Sabah
## 35 0 Sabah
## 36 0 Sabah
## 37 0 Sabah
## 38 0 Sabah
## 39 0 Sabah
## 40 0 Sabah
## 41 0 Sabah
## 42 0 Sabah
## 43 0 Sabah
## 44 0 Sabah
## 45 0 Sabah
## 46 0 Sabah
## 47 0 Sabah
## 48 0 Sabah
## 49 0 Sabah
## 50 0 Sabah
## 51 0 Sabah
## 52 0 Sabah
## 53 0 Sabah
## 54 0 Sabah
## 55 0 Sabah
## 56 0 Sabah
## 57 0 Sabah
## 58 0 Sabah
## 59 0 Sabah
## 60 0 Sabah
## 61 0 Sabah
## 62 0 Sabah
## 63 0 Sabah
## 64 0 Sabah
## 65 0 Sabah
## 66 1 Sabah
## 67 1 Sabah
## 68 1 Sabah
## 69 1 Sabah
## 70 1 Sabah
## 71 1 Sabah
## 72 1 Sabah
## 73 1 Sabah
## 74 1 Sabah
## 75 1 Sabah
## 76 1 Sabah
## 77 1 Sabah
## 78 1 Sabah
## 79 1 Sabah
## 80 1 Sabah
## 81 0 Panama
## 82 0 Panama
## 83 0 Panama
## 84 0 Panama
## 85 0 Panama
## 86 0 Panama
## 87 0 Panama
## 88 0 Panama
## 89 0 Panama
## 90 0 Panama
## 91 0 Panama
## 92 0 Panama
## 93 0 Panama
## 94 0 Panama
## 95 0 Panama
## 96 0 Panama
## 97 0 Panama
## 98 0 Panama
## 99 0 Panama
## 100 0 Panama
## 101 0 Panama
## 102 0 Panama
## 103 0 Panama
## 104 0 Panama
## 105 0 Panama
## 106 0 Panama
## 107 0 Panama
## 108 0 Panama
## 109 0 Panama
## 110 0 Panama
## 111 0 Panama
## 112 0 Panama
## 113 0 Panama
## 114 0 Panama
## 115 0 Panama
## 116 0 Panama
## 117 0 Panama
## 118 1 Panama
## 119 1 Panama
## 120 1 Panama
## 121 1 Panama
## 122 1 Panama
## 123 1 Panama
## 124 1 Panama
## 125 1 Panama
## 126 1 Panama
## 127 1 Panama
## 128 1 Panama
## 129 1 Panama
## 130 1 Panama
## 131 1 Panama
## 132 1 Panama
## 133 1 Panama
## 134 1 Panama
## 135 1 Panama
## 136 0 Costa Rica
## 137 0 Costa Rica
## 138 0 Costa Rica
## 139 0 Costa Rica
## 140 0 Costa Rica
## 141 0 Costa Rica
## 142 0 Costa Rica
## 143 0 Costa Rica
## 144 0 Costa Rica
## 145 0 Costa Rica
## 146 0 Costa Rica
## 147 0 Costa Rica
## 148 0 Costa Rica
## 149 0 Costa Rica
## 150 0 Costa Rica
## 151 0 Costa Rica
## 152 0 Costa Rica
## 153 0 Costa Rica
## 154 0 Costa Rica
## 155 0 Costa Rica
## 156 0 Costa Rica
## 157 0 Costa Rica
## 158 0 Costa Rica
## 159 0 Costa Rica
## 160 0 Costa Rica
## 161 0 Costa Rica
## 162 0 Costa Rica
## 163 0 Costa Rica
## 164 0 Costa Rica
## 165 0 Costa Rica
## 166 0 Costa Rica
## 167 0 Costa Rica
## 168 0 Costa Rica
## 169 0 Costa Rica
## 170 1 Costa Rica
## 171 1 Costa Rica
## 172 1 Costa Rica
## 173 1 Costa Rica
## 174 1 Costa Rica
## 175 1 Costa Rica
## 176 1 Costa Rica
## 177 1 Costa Rica
## 178 1 Costa Rica
## 179 1 Costa Rica
## 180 1 Costa Rica
## 181 1 Costa Rica
## 182 1 Costa Rica
## 183 1 Costa Rica
## 184 1 Costa Rica
## 185 1 Costa Rica
## 186 0 N Queensland
## 187 0 N Queensland
## 188 0 N Queensland
## 189 0 N Queensland
## 190 0 N Queensland
## 191 0 N Queensland
## 192 0 N Queensland
## 193 0 N Queensland
## 194 0 N Queensland
## 195 0 N Queensland
## 196 0 N Queensland
## 197 0 N Queensland
## 198 0 N Queensland
## 199 0 N Queensland
## 200 0 N Queensland
## 201 0 N Queensland
## 202 0 N Queensland
## 203 0 N Queensland
## 204 0 N Queensland
## 205 0 N Queensland
## 206 0 N Queensland
## 207 0 N Queensland
## 208 0 N Queensland
## 209 0 N Queensland
## 210 0 N Queensland
## 211 0 N Queensland
## 212 0 N Queensland
## 213 0 N Queensland
## 214 0 N Queensland
## 215 0 N Queensland
## 216 0 N Queensland
## 217 0 N Queensland
## 218 0 N Queensland
## 219 0 N Queensland
## 220 0 N Queensland
## 221 0 N Queensland
## 222 0 N Queensland
## 223 0 N Queensland
## 224 0 N Queensland
## 225 0 N Queensland
## 226 0 N Queensland
## 227 1 N Queensland
## 228 1 N Queensland
## 229 1 N Queensland
## 230 1 N Queensland
## 231 1 N Queensland
## 232 1 N Queensland
## 233 1 N Queensland
## 234 1 N Queensland
## 235 1 N Queensland
## 236 1 N Queensland
## 237 1 N Queensland
## 238 1 N Queensland
## 239 1 N Queensland
## 240 1 N Queensland
## 241 1 N Queensland
## 242 1 N Queensland
## 243 1 N Queensland
## 244 1 N Queensland
## 245 1 N Queensland
## 246 1 N Queensland
## 247 0 S Queensland
## 248 0 S Queensland
## 249 0 S Queensland
## 250 0 S Queensland
## 251 0 S Queensland
## 252 0 S Queensland
## 253 0 S Queensland
## 254 0 S Queensland
## 255 0 S Queensland
## 256 0 S Queensland
## 257 0 S Queensland
## 258 0 S Queensland
## 259 0 S Queensland
## 260 1 S Queensland
## 261 1 S Queensland
## 262 1 S Queensland
## 263 1 S Queensland
## 264 1 S Queensland
## 265 1 S Queensland
## 266 1 S Queensland
## 267 1 S Queensland
## 268 1 S Queensland
## 269 1 S Queensland
## 270 1 S Queensland
## 271 1 S Queensland
## 272 1 S Queensland
## 273 1 S Queensland
## 274 1 S Queensland
## 275 1 S Queensland
## 276 1 S Queensland
## 277 1 S Queensland
## 278 0 Tasmania
## 279 0 Tasmania
## 280 1 Tasmania
## 281 1 Tasmania
## 282 1 Tasmania
## 283 1 Tasmania
## 284 1 Tasmania
## 285 1 Tasmania
## 286 1 Tasmania
par(mar=c(1.1, 1.1, 1.1, 1.1))
ggpairs(leafshape[,c(1:3)], title="Standard Leaf measurements", diag=list(continuous='density'), axisLabels='none')#leafshape$arch <- unlist(leafshape$arch)
ggpairs(leafshape[,c(7:5)], title="Logaritmic Leaf measurements", diag=list(continuous='density'), axisLabels='none')#, mapping = ggplot2::aes(colour=leafshape[8]), lower = list(continuous = wrap("smooth", alpha = 0.3, size=0.1)))In the first set of variables there are many points with the same value for the leaf length (bladelen) and width (bladewid). That’s why there is so much overplotting in the first scatterplot.
For the second set of variables, the log transformation preserves the order of the observations while making outliers less extreme. So the log transform will enhance the visualization.
I consider the second one more useful as it shows more insight and provides a way to avoid plotting points that represent more than one case.
(b) Coloring the cases by the variable arch, describing the leaf architecture:
## Loading required package: carData
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:lawstat':
##
## levene.test
data(leafshape, package="DAAG")
#data(bank, package = "gclus")
par(mar = c(1.1, 1.1, 1.1, 1.1))
spm( leafshape[c(1:3)],pch = c(16, 16), diagonal = "histogram",smoother = FALSE, reg.line = FALSE,groups = leafshape$arch )library(car)
data(leafshape, package="DAAG")
#data(bank, package = "gclus")
par(mar = c(1.1, 1.1, 1.1, 1.1))
spm( leafshape[c(7:5)],pch = c(16, 16), diagonal = "histogram",smoother = FALSE, reg.line = FALSE,groups = leafshape$arch )By doing this, we can now observe two clusters formed, based on the arch variable, for each scatterplot in the matrix.
(a) Scatterplot matrix of the eight continuous variables representing fatty acids:
library(GGally)
library(ggplot2)
data(olive, package="zenplots")
#summary(olive)
ggpairs(olive[,c(3:10)], title="Olive acids", diag=list(continuous='density'), axisLabels='none')(b) There might be some outliers depending on different variables:
Other obsevations:
(a) Splom of all continous variables, excepting chas:
library(GGally)
library(ggplot2)
data(Boston, package="MASS")
#print(Boston)
ggpairs(Boston[,-c(4)], title="Boston housing", diag=list(continuous='density'), axisLabels='none') Variables that are positively associated with medv are: rm.
(b) Several scatterplots involving the variable crim have an unusual form, because some of the other variables have constant values in most cases. For example the ptratio has values only in a determined range.
(c)
library('GGally')
# (a)
ggparcoord(data = swiss, columns=c(1:6), scale="uniminmax", alphaLines=0.2) +
xlab("") + ylab("")(b) There might be some outliers depending on different variables:
(c) It looks like the variable Catholic has 2 modes, one at the lowest range (0 - 0.25) and one at the highest ends (0.8 - 1.0). So it is the case in Switzerland that usually a province will either have a majority of catholic or of non-catholic people.
# (d)
swiss1 <- within(swiss,
catholics_level <- factor(ifelse(Catholic > 80, 'High', 'Low')))
ggparcoord(data = swiss1[order(swiss1$Catholic),], columns=c(1:6), scale="uniminmax",
groupColumn="catholics_level", alphaLines=0.5) +
xlab("") + ylab("") +
theme(legend.position = "none")(d) The provinces with high level of Catholic look like they have a higher index of Fertility, a lower level of Examination (i.e. % draftees receiving highest mark on army examination) and Education. The Infant.Mortality variable looks like it is not affected that much by whether the province has a majority of catholic or non-catholic people.
## Loading required package: tools
(a) In this pcp we could see different details:
## [1] 0.53 0.56 0.60 0.63 0.67 0.67 0.67 0.67 0.68 0.72 1.50 1.56 1.62 1.62 1.63
## [16] 1.65 1.67 1.81 1.82 1.83 1.83 1.86 1.92 1.94 1.94 1.99 2.00 2.05 2.06 2.06
## [31] 2.33 3.43 3.47 3.77 3.88 3.94 4.26 4.30 4.52 5.34 5.51 5.64 5.69 5.91 7.23
pottery1 <- within(pottery,
mgo_level <- factor(ifelse(MgO < 1, 'Low', 'High'))) # use the 1.0 threshold here
ggparcoord(data = pottery1[order(pottery1$MgO),], columns=c(1:9), scale="uniminmax",
groupColumn="mgo_level", alphaLines=0.5) +
xlab("") + ylab("") +
theme(legend.position = "none")(b) On the other variables, the cases with low MgO also have lower Fe2O3, CaO, Na2O, K2O, MnO than the other cases. Also, some of these cases have higher values for Al2O3 and TiO2.
# (c)
ggparcoord(data = pottery, columns=c(1:9), scale="uniminmax",
groupColumn="kiln", alphaLines=0.5) +
xlab("") + ylab("") + geom_line(size=0.7)(c) In this pcp we can see some differences between different kilns as follows:
## pdfCluster 1.0-3
##
## Attaching package: 'pdfCluster'
## The following object is masked from 'package:dplyr':
##
## groups
data("oliveoil")
ggparcoord(data = oliveoil, columns=c(3:10), scale="uniminmax", alphaLines=0.2) +
xlab("") + ylab("")(a) Some of the observed features in this pcp:
# (b)
ggparcoord(data = oliveoil, columns=c(3:10), scale="uniminmax",
groupColumn="region", alphaLines=0.7) +
xlab("") + ylab("") (b) In this pcp we can find that:
(c) For the scatterplot matrix down below:
While for a pcp:
ggpairs(oliveoil[,c(3:10)], title="Olive acids", diag=list(continuous='density'), axisLabels='none')data(Cars93, package="MASS")
col_indices = which(names(Cars93)%in%c('Price', 'MPG.city', 'MPG.highway', 'Horsepower', 'RPM', 'Length', 'Width', 'Turn.circle', 'Weight'))
ggparcoord(data = Cars93, columns=col_indices, scale="uniminmax", alphaLines=0.7) +
xlab("") + ylab("") (a) In this plot we could conclude the following:
(b) I would plot a pcp (down below), and here we could observe some differences between USA and non-USA cars:
# (b)
ggparcoord(data = Cars93, columns=col_indices, scale="uniminmax",
groupColumn="Origin", alphaLines=0.7) +
xlab("") + ylab("")(c) Yes, a pcp with uniminmax scaling is informative, since we got to extract some insights from it in (a) and (b). Down below we can find the same pcp but with a standard scale applied (subtracting the mean and dividing by the standard deviation, for each axis) and having its observations categorized by the number of Cylinders. Here we can also see that:
# (c)
ggparcoord(data = Cars93, columns=col_indices,
groupColumn="Cylinders", alphaLines=0.7) +
xlab("") + ylab("") + geom_line(size=0.75)data(bodyfat, package="MMST")
ggparcoord(data = bodyfat, columns=1:15, scale="uniminmax", alphaLines=0.7) +
xlab("") + ylab("") (a) There is clearly one outlier which has the maximum value 1.0 on the uniminmax scale in this pcp for many variables (bodyfat, weight, neck, chest, abdomen, hip, thigh, knee, biceps, wrist). Not being the tallest man from the sample (looking at its height), I would say this outlier is not an athlete, but maybe a person with serious obesity problems.
Particularly, there are 2 more outliers on the ankle axis, who might also be outliers on the hip, abdomen and chest measurements.
(b) The height variable looks like it has many points of concentration, like it would be a categorical variable. Maybe the reason why this is happening is that the height was measured only in one decimal instead of using a higher precision. We can quickly check this down below:
# by looking at some of the observations, actually it looks like the data contains estimations to the closest quarter float (0.00/0.25/0.50/0.75).
bodyfat$height[1:10]## [1] 67.75 72.25 66.25 72.25 71.25 74.75 69.75 72.50 74.00 73.50
So the reason for this “categorical behaviour” is indeed the low precision of the floating numbers.
(c) As seen in the pcp, as the density increases, the bodyfat decreases, and vice-versa. So there is clearly a negative correlation between those 2 variables. This is also suggested by the intersection of all profiles in (what appears to be) a single point.
(d) Yes, the ordering of the variables can affect the pcp display. I think in this specific dataset we can try ordering the variables after their medians, so we can see all the measurement categories from highest to lowest to make a better idea about our data.
medians = apply(bodyfat[, 1:15], 2, median, na.rm=TRUE)
ordered_medians_indexes = order(medians)
ggparcoord(data = bodyfat, alphaLines=0.3,
scale="globalminmax", order=ordered_medians_indexes) + coord_flip()Here we can see for example that the wrist has the smallest measurement out of all the body measurements and that the chest is the largest part of a man’s body.
(a) I think a good pcp to present to others would be the following, with globalminmax scale and y limits 0-100, where we can deduce that:
## Loading required package: ellipse
##
## Attaching package: 'ellipse'
## The following object is masked from 'package:car':
##
## ellipse
## The following object is masked from 'package:graphics':
##
## pairs
##
## Attaching package: 'SMPracticals'
## The following object is masked from 'package:HSAUR2':
##
## smoking
## The following object is masked from 'package:GGally':
##
## pigs
## The following objects are masked from 'package:MASS':
##
## cement, forbes, leuk, shuttle
ggparcoord(data = mathmarks, columns=1:5, scale="globalminmax", alphaLines=0.7) +
xlab("") + ylab("") + coord_cartesian(ylim=c(0,100))We could try to plot a pcp by highlighting the students who scored better on the Vectors exam. In this pcp we can observe that:
mathmarks1 <- within(mathmarks,
vectors_perf <- factor(ifelse(vectors > 60, 'Good', 'Bad')))
ggparcoord(data = mathmarks1, columns=1:5, alphaLines=0.7, scale="globalminmax",
groupColumn='vectors_perf') +
xlab("") + ylab("") + theme(legend.position = "none") + coord_cartesian(ylim=c(0,100))# (b)
ggparcoord(data = mathmarks, columns=1:5, scale="globalminmax", alphaLines=0.1, boxplot=TRUE) +
xlab("") + ylab("") + coord_cartesian(ylim=c(0,100))(b) By using the boxplots, we can see easier the maximum, minimum, median values and the outliers for each exam. By looking at this plot, we can’t tell for sure that the Mechanics and Vectors were closed-book exams. We can say for sure that the Statistics exam (an open-book exam) has the lowest median mark out of all the subjects which would contradict somehow the idea of it being an open-book exam.
However, by looking at the minimum value (~3 points) of the Mechanics exam, we can say it is the lowest among all subjects, which confirms the fact it has been a closed-book exam.
Regarding the polygonal lines, I think it is not mandatory to draw then in a boxplot pcp in this specific scenario, even by using the alphaLines option, since it just fills the space without any useful additional information, as it can be seen above.
(a) By using the following pcp’s, which separate the profiles by the wine class (red - Barbera, green - Barolo, blue - Grignolino), we can easily distinguish which variables help us by classifying different wines:
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
data(wine, package="MMST")
a = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("red","grey", "grey")) + coord_flip()
b = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","green", "grey")) + coord_flip()
c = ggparcoord(wine, columns=1:13, groupColumn="class",scale="uniminmax") + xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","grey", "blue")) + coord_flip()
grid.arrange(a, b, c, nrow=1)(b) There are for sure enough outliers in the data, just by looking for example at the extreme high/low values from the blue pcp at the MalicAcid, Ash, Flav, Hue variables.
(c) Yes, there might be some evidence that these classes have subgroups of wines inside them. Some examples would be:
data(Boston, package="MASS")
hcluster = hclust(dist(Boston), method='ward.D2')
clu4 = cutree(hcluster, k=4)
clus = factor(clu4)
boston1 = cbind(Boston, clus)
# (a)
ggparcoord(boston1, columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("")# (b)
a = ggparcoord(boston1[which(boston1$clus == 1),], columns=1:14, scale="uniminmax",
mapping=aes(color='#f5aca7')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#f5aca7"))
b = ggparcoord(boston1[which(boston1$clus == 2),], columns=1:14, scale="uniminmax",
mapping=aes(color='#a8c75a')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#a8c75a"))
c = ggparcoord(boston1[which(boston1$clus == 3),], columns=1:14, scale="uniminmax",
mapping=aes(color='#1ec5c9')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#1ec5c9"))
d = ggparcoord(boston1[which(boston1$clus == 4),], columns=1:14, scale="uniminmax",
mapping=aes(color='#cc8afd')) +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#cc8afd"))
grid.arrange(a, b, c, d)# (c)
a = ggparcoord(rbind(boston1[which(boston1$clus != 1),], boston1[which(boston1$clus == 1),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("#f5aca7","grey", "grey", "grey"))
b = ggparcoord(rbind(boston1[which(boston1$clus != 2),], boston1[which(boston1$clus == 2),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","#a8c75a", "grey", "grey"))
c = ggparcoord(rbind(boston1[which(boston1$clus != 3),], boston1[which(boston1$clus == 3),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","grey", "#1ec5c9", "grey"))
d = ggparcoord(rbind(boston1[which(boston1$clus != 4),], boston1[which(boston1$clus == 4),]),
columns=1:14, groupColumn="clus", scale="uniminmax") +
xlab("") + ylab("") +
theme(legend.position = "none") +
scale_colour_manual(values = c("grey","grey", "grey", "#cc8afd"))
grid.arrange(a, b, c, d)The first plot could be useful when not enough space is available, however, when 4 clusters must be displayed, it becomes quickly really hard to distinguish the different categories, the plot becoming a messy mix of colors.
The second way, where we plot each cluster individually, looks much cleaner than the other ones. However, we can observe that the shapes are not the same as in the other two plots. That’s because plotting them individually will cause the minimum and maximum values on each axis to change, restricting them to the cluster domain only. So this way, a uniminmax scale would not be so helpful to make comparisons between clusters, but maybe a globalminmax would be more appropriate for that kind of task.
The third option, which consists of plotting each cluster individually and the other ones in the background, looks like a clean way of visualizing the groups and at the same time comparing their differences. It preserves the axis limits and we can spot much easier the outliers. If I had to choose, this would be the way I would display my clustering results.